cp: `fix: transformers v5.5.0 validation (2010)` into `r0.4.0` by svcnvidia-nemo-ci · Pull Request #2013 · NVIDIA-NeMo/Automodel

svcnvidia-nemo-ci · 2026-04-23T05:02:07Z

beep boop [🤖]: Hi @akoumpa 👋,

we've cherry picked #2010 into  for you! 🚀

Please review and approve this cherry pick by your convenience!

* catch StrictDataclassClassValidationError Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * test: limit nightly to stepfun recipe for stepfun CI repro Temporarily prune nightly_recipes.yml to the single stepfun/step_3.5_flash_hellaswag_pp.yaml recipe to iterate on the StrictDataclassClassValidationError fix without paying for the full nightly matrix. Not intended for merge. Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * fix: retry NeMoAutoTokenizer load when config trips layer_types validator AutoTokenizer.from_pretrained internally calls AutoConfig.from_pretrained to resolve the tokenizer class. For checkpoints whose config has layer_types longer than num_hidden_layers (e.g. stepfun-ai/Step-3.5-Flash), newer transformers rejects the config and huggingface_hub wraps the ValueError in StrictDataclassClassValidationError (not a ValueError subclass). The previous get_hf_config fix only covered the model-load path; the tokenizer path hit the same failure independently. On that specific validator failure, preload a config via get_hf_config (which truncates layer_types) and retry the tokenizer load with an explicit config=, bypassing the internal AutoConfig call. Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * refactor: relax validate_layer_type globally instead of preloading config The previous tokenizer retry preloaded a fixed config via get_hf_config and re-entered AutoTokenizer.from_pretrained with an explicit config=. That round-trip is brittle (reconstructs a config the tokenizer does not use) and only fixes the tokenizer call site. Replace it with relax_layer_types_validator(): a one-shot monkey-patch that swaps PretrainedConfig.validate_layer_type with a no-op and rewrites the already-frozen validator entries in every live subclass's __class_validators__ list. After that, any downstream call that instantiates a config with mismatched layer_types/num_hidden_layers skips the check. The tokenizer retry now just applies the patch and re-invokes super().from_pretrained(...) with the original kwargs. Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * fix: retry VLM AutoProcessor load on layer_types validation failure AutoProcessor.from_pretrained internally loads AutoConfig, so configs whose layer_types length differs from num_hidden_layers trip validate_layer_type through the processor path too. Previously the VLM build_dataloader caught the error under a broad except and silently set processor=None, producing a cryptic downstream failure. On the specific validator signature, call relax_layer_types_validator() and retry AutoProcessor.from_pretrained once. Unrelated exceptions keep the original fall-through to processor=None with a warning. LLM tokenizer path is already covered via NeMoAutoTokenizer. Also pass --force-exclude to the ruff pre-commit hooks so the tests/ exclusion already declared in pyproject.toml takes effect when pre-commit passes files explicitly. Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * revert Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * revert Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * fmt Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> --------- Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

svcnvidia-nemo-ci · 2026-04-23T05:02:10Z

/ok to test c748500

copy-pr-bot · 2026-04-23T05:02:11Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

svcnvidia-nemo-ci requested a review from akoumpa April 23, 2026 05:02

svcnvidia-nemo-ci requested review from HuiyingLi, ZhiyuLi-Nvidia, adil-a, akoumpa, athitten, hemildesai, pthombre and zyzhou5 as code owners April 23, 2026 05:02

svcnvidia-nemo-ci added cherry-pick Run CICD Trigger Testing CICD labels Apr 23, 2026

copy-pr-bot Bot temporarily deployed to nemo-ci April 23, 2026 05:02 Inactive

copy-pr-bot Bot temporarily deployed to test April 23, 2026 05:02 Inactive

akoumpa approved these changes Apr 23, 2026

View reviewed changes

akoumpa enabled auto-merge (squash) April 23, 2026 05:07

copy-pr-bot Bot temporarily deployed to nemo-ci April 23, 2026 05:08 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci April 23, 2026 05:31 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci April 23, 2026 05:56 Inactive

akoumpa merged commit 689e408 into r0.4.0 Apr 23, 2026
53 checks passed

akoumpa deleted the cherry-pick-2010-r0.4.0 branch April 23, 2026 06:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cp: `fix: transformers v5.5.0 validation (2010)` into `r0.4.0`#2013

cp: `fix: transformers v5.5.0 validation (2010)` into `r0.4.0`#2013
akoumpa merged 1 commit intor0.4.0from
cherry-pick-2010-r0.4.0

svcnvidia-nemo-ci commented Apr 23, 2026

Uh oh!

svcnvidia-nemo-ci commented Apr 23, 2026

Uh oh!

copy-pr-bot Bot commented Apr 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

svcnvidia-nemo-ci commented Apr 23, 2026

Uh oh!

svcnvidia-nemo-ci commented Apr 23, 2026

Uh oh!

copy-pr-bot Bot commented Apr 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants